Aggregate GPU task metrics in the profiling tool#2088
Conversation
Emits three new long-format CSVs covering the 26 GPU task accumulators from GpuTaskMetrics.scala (gpu_stage_/sql_/app_level_aggregated_task_metrics.csv). Auto-discovery by name (gpu*, perfio.s3.*, multithreadReaderMaxParallelism); units derived from the name (Time/Wait→ms, Bytes→bytes, else count); SQL/app levels re-sum stage rows. Skips emission when no GPU metrics are present. Job level intentionally skipped (each Spark action is a job — would either duplicate the SQL row or be meaningless). Fixes NVIDIA#2020 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Partho Sarthi <psarthi@nvidia.com>
Adds appId as the leading column on gpu_app_level_aggregated_task_metrics.csv so downstream consumers can join by application without relying on the output directory path. Also bumps the copyright year on touched files to 2026 (the pre-commit hook's sed is BSD-incompatible on macOS and silently no-ops). Fixes NVIDIA#2020 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Partho Sarthi <psarthi@nvidia.com>
ArrayBuffer.flatMap returns ArrayBuffer (mutable), which no longer auto-coerces to immutable.Seq under Scala 2.13. Materialize the per-SQL row collection as Seq before passing to rollupGpuRows, and use an explicit lambda for the inner flatMap. Fixes NVIDIA#2020 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Partho Sarthi <psarthi@nvidia.com>
Greptile SummaryThis PR adds long-format GPU task metric aggregations at three granularities (stage, SQL, app) by mining the existing GPU accumulators in The two previously flagged concerns remain open: the Confidence Score: 5/5Safe to merge; all remaining findings are P2 style/consistency issues that don't affect correctness. All logic reviewed: rolling-average usage of QualRawReportGenerator.scala — unconditional GPU label emission differs from Profiler.scala's nonEmpty guard. Important Files Changed
Flowchart%%{init: {'theme': 'neutral'}}%%
flowchart TD
A[app.accumManager] -->|filter isGpuMetric| B[GPU Accumulators]
B -->|calculateAccStatsForStage| C[aggregateGpuMetricsByStage]
C -->|StageAggGpuMetricsProfileResult| D[gpuStageRows]
D -->|groupBy stageId| E[stageMap]
F[app.sqlIdToStages] --> G[aggregateGpuMetricsBySql]
E --> G
G -->|rollupGpuRows per SQL| H[SQLAggGpuMetricsProfileResult]
D -->|rollupGpuRows all stages| I[aggregateGpuMetricsByApp]
I --> J[AppAggGpuMetricsProfileResult]
D --> K[AggRawMetricsResult.gpuStageAggs]
H --> L[AggRawMetricsResult.gpuSqlAggs]
J --> M[AggRawMetricsResult.gpuAppAggs]
K -->|nonEmpty guard| N[gpu_stage_level_aggregated_task_metrics.csv]
L -->|nonEmpty guard| O[gpu_sql_level_aggregated_task_metrics.csv]
M -->|nonEmpty guard| P[gpu_app_level_aggregated_task_metrics.csv]
Reviews (3): Last reviewed commit: "Address greptile review on PR #2088" | Re-trigger Greptile |
Signed-off-by: Partho Sarthi <psarthi@nvidia.com> # Conflicts: # core/src/main/scala/com/nvidia/spark/rapids/tool/analysis/AggRawMetricsResult.scala # core/src/main/scala/com/nvidia/spark/rapids/tool/analysis/AppSparkMetricsAggTrait.scala # core/src/main/scala/com/nvidia/spark/rapids/tool/profiling/ApplicationSummaryInfo.scala # core/src/main/scala/com/nvidia/spark/rapids/tool/profiling/Profiler.scala # core/src/main/scala/com/nvidia/spark/rapids/tool/views/QualRawReportGenerator.scala # core/src/main/scala/com/nvidia/spark/rapids/tool/views/RawMetricProfView.scala
- Drop dead StageAggGpuMetricsProfileResult.aggregateStageProfileMetric. Stage attempts already merge upstream at the AccumInfo layer (stagesStatMap is keyed by stageId only, not stageId+attemptNumber), so a separate merge step on the case class is never invoked. Replaced the method with a comment explaining the upstream merging. - Document the numTasks=0 invariant in aggregateGpuMetricsByStage and log a warning if the stage-task metrics cache lookup misses (which would silently distort the task-weighted avg at SQL/app level). Fixes NVIDIA#2020 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Partho Sarthi <psarthi@nvidia.com>
Contributes #2020
Changes
1. New GPU task-metric aggregation CSVs at three levels
Adds long-format aggregations for the 26 GPU task accumulators emitted by the RAPIDS plugin (
GpuTaskMetrics.scala). Today these are only available raw instage_level_all_metrics.csv.gpu_stage_level_aggregated_task_metrics.csvstageId, numTasks, metricName, unit, sum, max, avggpu_sql_level_aggregated_task_metrics.csvsqlId, metricName, unit, sum, max, avggpu_app_level_aggregated_task_metrics.csvappId, metricName, unit, sum, max, avgNote:
numTasksonly at stage level, where it varies; at SQL/app it would be a constant per row (already insql_level_aggregated_task_metrics.csv).Example (SQL row):
4. Max-aggregated metrics:
AccumMetaRef.METRICS_WITH_MAX_AGGREGATESextended from 4 → 9 entries. For these,sumandavgare emitted empty; onlymaxis meaningful.Testing
AnalysisSuite— three new tests: rows produced for GPU log + rollup math; max-aggregated metrics carry onlymax; empty for CPU-only log.core/src/test/resources/spark-events-profiling/gpu_oom_eventlog.zstd: all three CSVs produced; cross-level math verifies.